The importance of a feature in a machine learning model can change significantly when you use a non-linear function to transform the model's output. The most common type of transformation where this matters is the use of a "squashing" function. Squashing functions such as the logistic transform are often used to convert an unbounded "margin" space to a bounded probability space. The value of the margin space is then in the units of information, while the values in the probability space is in the units of probability. Which space you care about can be different in different situations. The margin space is better for adding and subtracting, and directly corresponds to "evidence" in an information-theoretic sense. However, if you only care about changes in % probability, not evidence, then you would be better off using the probability space. By choosing probability space you are saying that getting lots of powerful evidence that takes you from 98% probability to 99.99% probability is not nearly as important as a smaller amount of evidence that takes you from 50% probability to 60% probability. Why does it take more evidence to go from 98% probability to 99.99% than from 50% probability to 60%? It is because in an information theoretic sense, it takes more information to go from 98% certainty to 99.99%, than it does to go from 50% certainty to 60%.
Note that even though the logistic function is a monotonic transformation is can still change the ordering of which features are most important in a model. The ordering of features can change because some features may be very important for getting to 99.9% probability, while others are usually helpful in getting to 60% probability. The simple example below shows how you can change the importance of a feature using a squahing function:
In [3]:
import numpy as np
import xgboost
import scipy
import shap
import pandas as pd
In [4]:
shap.initjs()
In [5]:
# build a simple dataset
N = 500
M = 4
X = np.random.randn(N, M)
X[0,0] = 0
X[0,1] = 0
X = pd.DataFrame(X, columns=["A", "B", "C", "D"])
# a function (a made up ML model) with an output in "margin" space...
f = lambda X: (X[:,0] > 0) * 1 + (X[:,1] > 1.5) * 100
# ...and then also change its output to probability space
f_logistic = lambda X: scipy.special.expit(f(X))
In [7]:
# explain both functions
explainer = shap.KernelExplainer(f, X)
shap_values_f = explainer.shap_values(X.values[0:2,:])
explainer_logistic = shap.KernelExplainer(f_logistic, X)
shap_values_f_logistic = explainer_logistic.shap_values(X.values[0:2,:])
In [8]:
shap_values_f[0,:]
Out[8]:
In [9]:
shap.force_plot(float(explainer.expected_value), shap_values_f[0,:], X.iloc[0,:])
Out[9]:
In [10]:
shap_values_f_logistic[0,:]
Out[10]:
In [11]:
shap.force_plot(float(explainer_logistic.expected_value), shap_values_f_logistic[0,:], X.iloc[0,:])
Out[11]:
In [ ]: